conjecture 3
- North America > United States (0.28)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- North America > United States (0.14)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
DeMo: Decoupled Momentum Optimization
Peng, Bowen, Quesnelle, Jeffrey, Kingma, Diederik P.
Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce {\textbf{De}}coupled {\textbf{Mo}}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at https://github.com/bloc97/DeMo
- North America > United States (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
KernelSHAP-IQ: Weighted Least-Square Optimization for Shapley Interactions
Fumagalli, Fabian, Muschalik, Maximilian, Kolpaczki, Patrick, Hüllermeier, Eyke, Hammer, Barbara
The Shapley value (SV) is a prevalent approach of allocating credit to machine learning (ML) entities to understand black box ML models. Enriching such interpretations with higher-order interactions is inevitable for complex systems, where the Shapley Interaction Index (SII) is a direct axiomatic extension of the SV. While it is well-known that the SV yields an optimal approximation of any game via a weighted least square (WLS) objective, an extension of this result to SII has been a long-standing open problem, which even led to the proposal of an alternative index. In this work, we characterize higher-order SII as a solution to a WLS problem, which constructs an optimal approximation via SII and $k$-Shapley values ($k$-SII). We prove this representation for the SV and pairwise SII and give empirically validated conjectures for higher orders. As a result, we propose KernelSHAP-IQ, a direct extension of KernelSHAP for SII, and demonstrate state-of-the-art performance for feature interactions.
- Europe > Austria > Vienna (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Europe > Netherlands (0.04)
- (2 more...)
An Analysis of the Expressiveness of Deep Neural Network Architectures Based on Their Lipschitz Constants
Zhou, SiQi, Schoellig, Angela P.
Deep neural networks (DNNs) have emerged as a popular mathematical tool for function approximation due to their capability of modelling highly nonlinear functions. Their applications range from image classification and natural language processing to learning-based control. Despite their empirical successes, there is still a lack of theoretical understanding of the representative power of such deep architectures. In this work, we provide a theoretical analysis of the expressiveness of fully-connected, feedforward DNNs with 1-Lipschitz activation functions. In particular, we characterize the expressiveness of a DNN by its Lipchitz constant. By leveraging random matrix theory, we show that, given sufficiently large and randomly distributed weights, the expected upper and lower bounds of the Lipschitz constant of a DNN and hence their expressiveness increase exponentially with depth and polynomially with width, which gives rise to the benefit of the depth of DNN architectures for efficient function approximation. This observation is consistent with established results based on alternative expressiveness measures of DNNs. In contrast to most of the existing work, our analysis based on the Lipschitz properties of DNNs is applicable to a wider range of activation nonlinearities and potentially allows us to make sensible comparisons between the complexity of a DNN and the function to be approximated by the DNN. We consider this work to be a step towards understanding the expressive power of DNNs and towards designing appropriate deep architectures for practical applications such as system control.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences
Recht, Benjamin, Re, Christopher
Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling with- and without-replacement in such algorithms. Focusing on least means squares optimization, we formulate a noncommutative arithmetic-geometric mean inequality that would prove that the expected convergence rate of without-replacement sampling is faster than that of with-replacement sampling. We demonstrate that this inequality holds for many classes of random matrices and for some pathological examples as well. We provide a deterministic worst-case bound on the gap between the discrepancy between the two sampling models, and explore some of the impediments to proving this inequality in full generality. We detail the consequences of this inequality for stochastic gradient descent and the randomized Kaczmarz algorithm for solving linear systems.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)